Efficient Sketches for the Set Query Problem* 
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Abstract 

We develop an algorithm for estimating the values of a vector 
x G R n over a support S of size k from a randomized sparse 
binary linear sketch Ax of size 0(k). Given Ax and S, we can 
recover x' with \\x' 

least 1 — hT "K The recovery takes O(k) time 

While interesting in its own right, this primitive also has a 
number of applications. For example, we can: 

1. Improve the linear fc-sparse recovery of heavy hitters in 
Zipfian distributions with O(fclogn) space from a 1 + e 
approximation to a 1 + o(l) approximation, giving the 
first such approximation in O(fclogn) space when k < 

2. Recover block-sparse vectors with O(k) space and a 1 + e 
approximation. Previous algorithms required either u)(k) 
space or w(l) approximation. 



1 Introduction 

In recent years, a new "linear" approach for obtaining 
a succinct approximate representation of n-dimensional 
vectors (or signals) has been discovered. For any signal x, 
the representation is equal to Ax, where A is an to x n ma- 
trix, or possibly a random variable chosen from some dis- 
tribution over such matrices. The vector Ax is often re- 
ferred to as the measurement vector or linear sketch of x. 
Although to is typically much smaller than n, the sketch 
Ax often contains plenty of useful information about the 
signal x. 

A particularly useful and well-studied problem is that 
of stable sparse recovery. The problem is typically defined 
as follows: for some norm parameters p and q and an 
approximation factor C > 0, given Ax, recover a vector 
x' such that 



(1) 



\\x'-x\\ p <C-ETT q (x,k), 



where Err q (a;, k) 



fc-sparse x 



where we say that x is /c-sparse if it has at most k non-zero 
coordinates. Sparse recovery has applications to numer- 
ous areas such as data stream computing [M ut03[ IInd07j 
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and compressed sensing CRT06, Don06j, notably for con- 
structing imaging syste ms that ac quire images directly in 
compressed form (e.g., [DDT+08 , Rom09j). The problem 
has been a subject of extensive study over the last sev- 
eral years, with the goal of designing schemes that enjoy 
good "compression rate" (i.e., low values of to) as well 
as good algorithmic properties (i.e., low encoding and 
recovery times). It is known that there exist distribu- 
tions of matrices A and associated recovery algorithms 
that for any x with high probability produce approxima- 
tions x' satisfying Equation |l]) with £ p = £ q — £ 2 , con- 
stant approximation factor C = 1 + e, and sketch length 
to = 0(A:log(n/fc))F] it is also known that this sketch 
length is asymptotically optimal JDIPWIOI IFPRU10) . 
Similar results for other combinations of £ p /£ q norms are 
known as well. 

Because it is impossible to improve on the sketch 
size in the general sparse recovery problem, recently 
there has been a large body of work on more re- 
stricted problems that are amenable to more efficient 
solutions. This includes model-based compressive sens- 
ing BCDH10J, which imposes additional constraints (or 
models) on x beyond near-sparsity. Examples of models 
include block sparsity, where the large coefficients tend to 
cluster together in blocks [BCDHlOi [EKB09J : tree spar- 
sity, where the large coefficients form a rooted, connected 
tree structure [BCDH10. ILD05J : and being Zipfian, where 
we require that the histogram of coefficient size follow a 
Zipfian (or power law) distribution. 

A sparse recovery algorithm needs to perform two 
tasks: locating the large coefficients of x and estimating 
their value. Existing algorithms perform both tasks at 
the same time. In contrast, we propose decoupling these 
tasks. In models of interest, including Zipfian signals and 
block-sparse signals, existing techniques can locate the 
large coefficients more efficiently or accurately than they 
can estimate them. Prior to this work, however, estimat- 
ing the large coefficients after finding them had no better 
solution than the general sparse recovery problem. We 
fill this gap by giving an optimal method for estimating 
the values of the large coefficients after locating them. 



1 In particular, a random Gaussian matrix CD04 or a random 
sparse binary matrix ( |GLPS09| . building on [CCF02 CM04]) has 
this property with overwhelming probability. See [GIlO. for an 



overview. 



We refer to this task as the Set Query Problem^ 

Main result. (Set Query Algorithm.) We give a ran- 
domized distribution over 0(k) x n binary matrices A 
such that, for any vector x £ K." and set S C {1, . . . , n} 
with \S\ = k, we can recover an x' from Ax + v and S 
with 

\\x' -x s \\ 2 < e(||a;-a;s|| 2 + ||v|| 2 ) 

where xs € R n equals x over S and zero elsewhere. The 
matrix A has 0{\) non-zero entries per column, recovery 
succeeds with probability 1 — k^ n ^\ and recovery takes 
0(k) time. This can be achieved for arbitrarily small 
e > 0, using 0(k/e 2 ) rows. We achieve a similar result in 
the t\ norm. 

The set query problem is useful in scenarios when, 
given a sketch of x, we have some alternative methods for 
discovering a "good" support of an approximation to x. 
This is the case, e.g., in block-sparse recovery, where (as 
we show in this paper) it is possible to identify "heavy" 
blocks using other methods. It is also a natural prob- 
lem in itself. In particular, it generalizes the well-studied 
point query problem [CM04] . which considers the case 
that S is a singleton. We note that, although the set 
query problem for sets of size k can be reduced to k in- 
stances of the point query problem, this reduction is less 
space-efhcient than the algorithm we propose, as elabo- 
rated below. 

Techniques. Our method is related to existing sparse 
recovery algorithms, including Count-Sketch [CCF02] 
and Count-Min CM04 . In fact, our sketch matrix A 
is almost identical to the one used in Count-Sketch — 
each column of A has d random locations out of O(kd) 
each independently set to ±1, and the columns are in- 
dependently generated. We can view such a matrix as 
"hashing" each coordinate to d "buckets" out of 0{kd). 
The difference is that the previous algorithms require 
0(k log k) measurements to achieve our error bound (and 
d = O(logfc)), while we only need O(k) measurements 
and d= 0(1). 

We overcome two obstacles to bring d down to O(l) 
and still achieve the error bound with high probabilitjr] 
First, in order to estimate the coordinates Xi, we need 
a more elaborate method than, say, taking the median 
of the buckets that i was hashed into. This is because, 
with constant probability, all such buckets might contain 
some other elements from S (be "heavy") and therefore 
using any of them as an estimator for yi would result in 
too much error. Since, for super-constant values of \S\, it 
is highly likely that such an event will occur for at least 
one i € S, it follows that this type of estimation results 
in large error. 



We solve this issue by using our knowledge of S. We 
know when a bucket is "corrupted" (that is, contains 
more than one element of S) , so we only estimate coordi- 
nates that lie in a large number of uncorrupted buckets. 
Once we estimate a coordinate, we subtract our estima- 
tion of its value from the buckets it is contained in. This 
potentially decreases the number of corrupted buckets, 
allowing us to estimate more coordinates. We show that, 
with high probability, this procedure can continue until 
it estimates every coordinate in 5". 

The other issue with the previous algorithms is that 
their analysis of their probability of success does not de- 
pend on k. This means that, even if the "head" did not 
interfere, their chance of success would be a constant (like 
1 — 2~ n ( d ^) rather than high probability in k (meaning 
1 — k~ n ( d )). We show that the errors in our estimates 
of coordinates have low covariance, which allows us to 
apply Chebyshev's inequality to get that the total error 
is concentrated around the mean with high probability. 

Related work. A similar recovery algorithm (with 
d = 2) has been analyzed and applied in a streaming 
context in |EG07j . However, in that paper the authors 
only consider the case where the vector y is fc-sparse. In 
that case, the termination property alone suffices, since 
there is no error to bound. Furthermore, because d = 2 
they only achieve a constant probability of success. In 
this paper we consider general vectors y so we need to 
make sure the error remains bounded, and we achieve a 
high probability of success. 

The recovery procedure also has similarities to recover- 
ing LDPCs using belief propagation, especially over the 
binary erasure channel. The similarities are strongest for 
exact recovery of A:-sparse y\ our method for bounding 
the error from noise is quite different. 

Applications. Our efficient solution to the set query 
problem can be combined with existing techniques to 
achieve sparse recovery under several models. 

We say that a vector x follows a Zipfian or power law 
distribution with parameter a if aV(») = ©( kV(i) i~ a ) 
where r(i) is the location of the ith largest coefficient 
in x. When a > 1/2, x is well approximated in the £2 
norm by its sparse approximation. Because a wide va- 
riety of real world signals follow power law distributions 
( [Mit04| IBKM+OOJ ). this notion (related to "compress- 
ibility'Q is often considered to be much of the reason 
why sparse recovery is interesting jCT06| ICev08| . Prior 
to this work, sparse recovery of power law distributions 
has only been solved via general sparse recovery methods: 
(1 + e)Err2(x, k) error in 0(klog(n/k)) measurements. 

However, locating the large coefficients in a power law 



used in 



2 The term "set query" is in contrast to "point query,' 
e.g. |CM04| for estimation of a single coordinate. 

3 In this paper, "high probability" means probability at least 
1 — l/fc c for some constant c > 0. 



A signal is "compressible" when \x 



r(i)\ = 0(|av(i)|i a ) rather 
than 0(|a; r n) | i~ a ) |CT06j . This allows it to decay very quickly 
then stop decaying for a while; we require that the decay be con- 
tinuous. 



distribution has long been easier than in a general dis- 
tribution. Using O(fclogTi) measurements, the Count- 
Sketch algorithm [CCF02] can produce a candidate set 
S C {1, . . . , b} with \S\ = 0{k) that includes all of the top 
k positions in a power law distribution with high prob- 
ability (if a > 1/2). We can then apply our set query 
algorithm to recover an approximation x' to xs ■ Because 
we already are using O(fclogn) measurements on Count- 
Sketch, we use 0(k log n) rather than 0(k) measurements 
in the set query algorithm to get an e/^flogn rather than 
e approximation. This lets us recover a /c-sparse x' with 
O(fclogn) measurements with 



\x'-x\\ 2 <(l 



v/Iogn 



Err2(x, k). 



This is especially interesting in the common regime where 
k < n}~ c for some constant c > 0. Then no previ- 
ous algorithms achieve better than a (1 + e) approxima- 
tion with O(fclogn) measurements, and the lower bound 
in [DIPW10J shows that any 0{\) approximation requires 
fi(fclogn) measurements^] This means at O(fclogn) 
measurements, the best approximation changes from 
w(l) to l + o(l). 

Another application is that of hnding block-sparse ap- 
proximations. In this application, the coordinate set 
{l...n} is partitioned into n/b blocks, each of length 
b. We define a (k, &)-block-sparse vector to be a vector 
where all non-zero elements are contained in at most k/b 
blocks. An example of block-sparse data is time series 
data from n/b locations over b time steps, where only 
k/b locations are "active" . We can define 



Err 2 (x, k, b) 



mm || a: • 

(fc, b) — block-sparse x 



The block-sparse recovery problem can now be formu- 
lated analogously to Equation [T] Since the formulation 
imposes restrictions on the sparsity patterns, it is natu- 
ral to expect that one can perform sparse recovery from 
fewer than 0(k\og(n/k)) measurements needed in the 
general case. Because of that reason and the prevalence 
of approximately block-sparse signals, the problem of sta- 
ble recovery of variants of block-sparse approximations 
has been recently a subject of extensive research (e.g., 
see |EB091 ISPH091 IBCDH101 ICIHB09] ) . The state of the 
art algorithm has been given in BCDHTO] . who gave a 
probabilistic construction of a single mxn matrix A, with 
rn = 0(k+ r log n), and an n log ' n-time algorithm for 
performing the block-sparse recovery in the t\ norm (as 
well as other variants). If the blocks have size f2(logn), 
the algorithm uses only 0(k) measurements, which is a 



J The lower bound only applies to geometric distributions, not 
Zipfian ones. However, our algorithm applies to more general sub- 
Zipfian distributions (denned in Section 14.1k, which includes both. 



substantial improvement over the general bound. How- 
ever, the approximation factor C guaranteed by that al- 
gorithm was super-constant. 

In this paper, we provide a distribution over matri- 
ces A, with m — 0(k + |logn), which enables solving 
this problem with a constant approximation factor and 
in the £2 norm, with high probability. As with Zipfian 
distributions, first one algorithm tells us where to find 
the heavy hitters and then the set query algorithm esti- 
mates their values. In this case, we modify the algorithm 
of [ABI08 to find block heavy hitters, which enables us to 
find the support of the | "most significant blocks" using 
0(| logn) measurements. The essence is to perform di- 
mensionality reduction of each block from b to O(logn) 
dimensions, then estimate the result with a linear hash 
table. For each block, most of the projections are esti- 
mated pretty well, so the median is a good estimator of 
the block's norm. Once the support is identified, we can 
recover the coefficients using the set query algorithm. 



2 Preliminaries 



2.1 Notation 

For n € Z+, we denote {1, . . . , n} by [to]. Suppose x £ K n . 
Then for i £ [n], Xi £ E denotes the value of the i- 
th coordinate in x. As an exception, a £ M. n denotes 
the elementary unit vector with a one at position i. For 
S C [n], xs denotes the vector x' £ R n given by x\ = Xi 
if i £ S, and x\ = otherwise. We use supp(x) to denote 
the support of x. We use upper case letters to denote 
sets, matrices, and random distributions. We use lower 
case letters for scalars and vectors. 



2.2 Negative Association 

This paper would like to make a claim of the form "We 
have k observations each of whose error has small expec- 
tation and variance. Therefore the average error is small 
with high probability in fc." If the errors were indepen- 
dent this would be immediate from Chebyshev's inequal- 
ity, but our errors depend on each other. Fortunately, our 
errors have some tendency to behave even better than if 
they were independent: the more noise that appears in 
one coordinate, the less remains to land in other coordi- 
nates. We use negative dependence to refer to this general 
class of behavior. The specific forms of negative depen- 
dence we use are negative association and approximate 
negative correlation] see Appendix [A] for details on these 
notions. 



3 Set-Query Algorithm 

Theorem 3.1. There is a randomized sparse binary 
sketch matrix A and recovery algorithm srf , such that for 
any x G E n , S C [n] with \S\ = k, x' = ja?(Ax + v, S) G 
K n has supp(x') C 5 and 

||a;' - a<j|| 2 < e(||a;-a;s|| 2 + ||w|| 2 ) 

zwit/i probability at least 1 — l/fc c . A has 0(-%k) rows and 
O(c) non-zero entries per column, and srf runs in 0{ck) 
time. 

One can achieve \\x' — xs\\ l < e(||a; — a;s|| 1 + ||i / || 1 ) un- 
der the same conditions, but with only O(-k) rows. 



We will first show Theorem 3.1 for a constant c = 1/3 
rather than for general c. Parallel repetition gives the 
theorem for general c, as described in Section |3.7| We 
will also only show it with entries of A being in {0, 1, —1}. 
By splitting each row in two, one for the positive and one 
for the negative entries, we get a binary matrix with the 
same properties. The paper focuses on the more difficult 
(.2 result; see Appendix [B] for details on the l\ result. 

3.1 Intuition 

We call xs the "head" and x — xs the "tail." The head 
probably contains the heavy hitters, with much more 
mass than the tail of the distribution. We would like 
to estimate xs with zero error from the head and small 
error from the tail with high probability. 

Our algorithm is related to the standard Count- 
Sketch (CCF02J and Count-Min [CM04| algorithms. In 
order to point out the differences, let us examine how 
they perform on this task. These algorithms show that 
hashing into a single w = 0(k) sized hash table is good 
in the sense that each point Xi has: 

1. Zero error from the head with constant probability 
(namely 1 - £). 

2. A small amount of error from the tail in expectation 
(and hence with constant probability). 

They then iterate this procedure d times and take the 
median, so that each estimate has small error with prob- 
ability 1 - 2~ n( - d \ With d = O(logfe), we get that all 
k estimates in S are good with O(fclogfc) measurements 
with high probability in k. With fewer measurements, 
however, some Xi will probably have error from the head. 
If the head is much larger than the tail (such as when 
the tail is zero), this is a major problem. Furthermore, 
with 0(k) measurements the error from the tail would be 
small only in expectation, not with high probability. 

We make three observations that allow us to use only 
0(k) measurements to estimate xs with error relative to 
the tail with high probability in k. 



1. The total error from the tail over a support of size 
k is concentrated more strongly than the error at a 
single point: the error probability drops as /c~ n ( d ) 
rather than 2~ n{d \ 

2. The error from the head can be avoided if one knows 
where the head is, by modifying the recovery algo- 
rithm. 

3. The error from the tail remains concentrated after 
modifying the recovery algorithm. 

For simplicity this paper does not directly show (1), 
only (2) and (3). The modification to the algorithm to 
achieve (2) is quite natural, and described in detail and 
illustrated in Section [3~2) Rather than estimate every co- 
ordinate in S immediately, we only estimate those coordi- 
nates which mostly do not overlap with other coordinates 
in S. In particular, we only estimate Xi as the median 
of at least d — 2 positions that are not in the image of 
S \ {i}. Once we learn Xi, we can subtract Axiei from 
the observed Ax and repeat on A(x — XiCi) and S \ {i}. 
Because we only look at positions that are in the image 
of only one remaining element of S, this avoids any error 
from the head. We show in Section |3.3| that this algo- 
rithm never gets stuck; we can always find some position 
that mostly doesn't overlap with the image of the rest of 
the remaining support. 

We then show that the error from the tail has low ex- 
pectation, and that it is strongly concentrated. We think 
of the tail as noise located in each "cell" (coordinate in 
the image space) . We decompose the error of our result 
into two parts: the "point error" and the "propagation" . 
The point error is error introduced in our estimate of 
some Xi based on noise in the cells that we estimate Xj 
from, and equals the median of the noise in those cells. 
The "propagation" is the error that comes from point 
error in estimating other coordinates in the same con- 
nected component; these errors propagate through the 
component as we subtract off incorrect estimates of each 

Xi . 

Section 



3.4 shows how to decompose the total error in 



terms of point errors and the component sizes. The two 
following sections bound the expectation and variance of 
these two quantities and show that they obey some no- 
tions of negative dependence. We combine these errors in 



Section 3.7 to get Theorem 3.1 with a specific c (namely 
c = 1/3). We then use parallel repetition to achieve The- 
orem HO] for arbitrary c. 



3.2 Algorithm 

We describe the sketch matrix A and recovery proce- 



dure in Algorithm |3~I1 Unlike Count-Sketch CCF02 or 



Count-Min |CM04| . our A is not split into d hash tables 




Figure 1: An instance of the set query problem. There are 
n vertices on the left, corresponding to x, and the table 
on the right represents Ax. Each vertex i on the left maps 
to d cells on the right, randomly increasing or decreasing 
the value in each cell by x%. We represent addition by 
black lines, and subtraction by red lines. We are told the 
locations of the heavy hitters, which we represent by blue 
circles; the rest is represented by yellow circles. 

of size 0(k). Instead, it has a single w — 0(d 2 k/e 2 ) sized 
hash table where each coordinate is hashed into d unique 
positions. We can think of A as a random d-uniform 
hypergraph, where the non-zero entries in each column 
correspond to the terminals of a hyperedge. We say that 
A is drawn from G d (w,n) with random signs associated 
with each (hyperedge, terminal) pair. We do this so we 
will be able to apply existing theorems on random hyper- 
graphs. 

Figure [T] shows an example Ax for a given x, and Fig- 
ure [2] demonstrates running the recovery procedure on 
this instance. 



Lemma 3.1. Algorithm 3.1 runs in time 0(dk). 



Proof. A has d entries per column. For each of the at 
most dk rows q in the image of S, we can store the preim- 
ages P(q). We also keep track of the sets of possible next 
hyperedges, Jj = {j | \Lj\ > d — i} for i £ {1, 2}. We can 
compute these in an initial pass in O(dk). Then in each 
iteration, we remove an element j £ J\ or J 2 and update 
x'p b, and T in 0(d) time. We then look at the two or 
fewer non-isolated vertices q in hyperedge j, and remove 
j from the associated P(q). If this makes \P(q)\ = 1, 
we check whether to insert the element in P(q) into the 
Ji. Hence the inner loop takes 0(d) time, for 0{dk) to- 
tal. □ 





Figure 2: Example run of the algorithm. Part (a) shows 
the state as considered by the algorithm: Ax and the 
graph structure corresponding to the given support. In 
part (b), the algorithm chooses a hyperedge with at least 
d — 2 isolated vertices and estimates the value as the 
median of those isolated vertices multiplied by the sign 
of the corresponding edge. In part (c), the image of the 
first vertex has been removed from Ax and we repeat on 
the smaller graph. We continue until the entire support 
has been estimated, as in part (d). 



3.3 Exact Recovery 

The random hypergraph G d (w, k) of k random d-uniform 
hyperedges on w vertices is well studied in [KL02 . We 
use their results to show that the algorithm successfully 



Definition of sketch matrix A. For a constant d, let 
A be a w x n = O(^-fc) x n matrix where each column 
is chosen independently uniformly at random over all ex- 
actly d-sparse columns with entries in {—1, 0, 1}. We can 
think of A as the incidence matrix of a random d- uniform 
hypergraph with random signs. 
Recovery procedure. 
l: procedure SetQuery(A, S, b) > Recover 

approximation x' to x$ from b = Ax + v 
T<- S 
while \T\ > do 

Define P(q) = {j \ A qj ^ 0, j £ T} as the set 
of hyperedges in T that contain q. 



2 
3 

i 

5: 



Define Zy = {q \ A qj ^ 0, \P(q)\ = 1} as the 
set of isolated vertices in hyperedge j. 
6: Choose a random j £ T such that \Lj\ > d—1. 

If this is not possible, find a random j £ T such that 



\Lj\ > d — 2. If neither is possible, abort. 



x'j <- median geLj . A qj b q 



x' 3 Ae 3 



b 

T^T\{j} 
end while 
return x' 
end procedure 



Algorithm 3.1: Recovering a signal given its support. 



terminates with high probability, and that most hyper- 
edges are chosen with at least d—1 isolated vertices: 

Lemma 3.2. With probability at least 1 — 0(1/ k), Al- 
gorithm \3.1\ terminates without aborting. Furthermore, 
in each component at most one hyperedge is chosen with 
only d — 1 isolated vertices. 

We will show this by building up a couple lemmas. We 
define a connected hypergraph H with r vertices on s 
hyperedges to be a hypertree if r — s(d — 1) + 1 and to 
be unicyclic if r = s(d — 1). Then Theorem 4 of |KL02j 
shows that, if the graph is sufficiently sparse, <& d (w, k) is 
probably composed entirely of hypertrees and unicyclic 
components. The precise statement is as follow£[J 



Lemma 3.3 (Theorem 4 of [KL02J ). Let m = w/d(d - 
1) - k. Then with probability 1 - 0(d 5 w 2 /m 3 ), G d (w, k) 
is composed entirely of hypertrees and unicyclic compo- 
nents. 

We use a simple consequence: 

Corollary 3.1. If d = 0(1) and w > 2d(d - l)k, then 
with probability l — 0(l/k), G d (w, k) is composed entirely 
of hypertrees and unicyclic 



We now prove some basic facts about hypertrees and 
unicyclic components: 

Lemma 3.4. Every hypertree has a hyperedge incident 
on at least d—1 isolated vertices. Every unicyclic com- 
ponent either has a hyperedge incident on d — 1 isolated 
vertices or has a hyperedge incident on d—1 isolated ver- 
tices, the removal of which turns the unicyclic component 
into a hypertree. 

Proof. Let H be a connected component of s hyperedges 
and r vertices. 

If H is a hypertree, r = (d — l)s + 1. Because H has 
only ds total (hyperedge, incident vertex) pairs, at most 
2(s — 1) of these pairs can involve vertices that appear in 
two or more hyperedges. Thus at least one of the s edges 
is incident on at most one vertex that is not isolated, so 
some edge has d—1 isolated vertices. 

If H is unicyclic, r = (d— l)s and so at most 2s of the 
(hyperedge, incident vertex) pairs involve non-isolated 
vertices. Therefore on average, each edge has d — 2 iso- 
lated vertices. If no edge is incident on at least d—1 iso- 
lated vertices, every edge must be incident on exactly d—1 
isolated vertices. In that case, each edge is incident on 
exactly two non-isolated vertices and each non-isolated 
vertex is in exactly two edges. Hence we can perform an 
Eulerian tour of all the edges, so removing any edge does 
not disconnect the graph. After removing the edge, the 
graph has s' = s — 1 edges and r' = r — d + 1 vertices; 
therefore r' = (d—l)s' + l so the graph is a hypertree. □ 

Corollary |3.1| and Lemma |3.4| combine to show 
Lemma 13.21 



3.4 Total error in terms of point error 
and component size 

Define Cij to be the event that hyperedges i and j are in 
the same component, and Di — J^ Cij to be the number 
of hyperedges in the same component as i. Define Li to 
be the cells that are used to estimate i; so Li — {q | 
A q j ^ 0, |-P(<?)| = 1} at the round of the algorithm when 
i is estimated. Define Yi = median (?e i i A q i(b — Axs) q 
to be the "point error" for hyperedge i, and x' to be 
the output of the algorithm. Then the deviation of the 
output at any coordinate i is at most twice the sum of 
the point errors in the component containing i: 



Lemma 3.5. 



\(x'-x s )i\ 



<2j2\Yj\C id . 



"Their statement of the theorem is slightly different. This is the 
last equation in their proof of the theorem. 



Proof. Let Tj = (x 1 — x$)i, and define Ri = {j | j ^ 
i,3q £ Li s.t. A q j ^ 0} to be the set of hyperedges that 



overlap with the cells used to estimate i. Then from the 
description of the algorithm, it follows that 

Tt = median A qi ( (b - Ax s ) q - Y^ A qj Tj) 

q£Li * — ' 

3 



\T,\ < \Y,\ 



- E ft 



We can think of the Ri as a directed acyclic graph (DAG), 
where there is an edge from j to % if j S Ri. Then if p(i, j) 
is the number of paths from i to j, 

|T,|<^Ki,i)l^|. 



Let r(i) — \{j \ i e Rj}\ be the outdegree of the DAG. 
Because the Li are disjoint, r(i) < d — |£»|. From 



Lemma 3.2, r(i) < 1 for all but one hyperedge in the 
component, and r(i) < 2 for that one. Hence p(i,j) < 2 
for any i and j, giving the result. □ 

We use the following corollary: 

Corollary 3.2. 



n 2 
xs\\ 2 






Proof. 



n 2 
x s\\ 2 



E(-'-^< 4 E(E \ Y i 



%es 



ies jes 
Ct ,=i 



<4j> E M 2 = 4£A ? i? 



l£S 



c, ,=i 



ies 



where the second inequality is the power means inequal- 
ity. □ 

The Dj and Yj are independent from each other, since 
one depends only on A over S and one only on A over 
[n] \ S. Therefore we can analyze them separately; the 
next two sections show bounds and negative dependence 
results for Yj and Dj, respectively. 

3.5 Bound on point error 



Recall from Section [374] that based entirely on the set S 
and the columns of A corresponding to S, we can identify 
the positions Li used to estimate Xi- We then defined the 
"point error" 

Yi = median A„i(b — Ax s)q = median A„i{A(x — x$) + v) q 

and showed how to relate the total error to the point er- 
ror. Here we would like to show that the K have bounded 



moments and are negatively dependent. Unfortunately, 
it turns out that the Yi are not negatively associated so 
it is unclear how to show negative dependence directly 
Instead, we will define some other variables Z; that are 
always larger than the corresponding Yi. We will then 
show that the Zi have bounded moments and negative 
association. 

We use the term "NA" throughout the proof to de- 
note negative association. For the definition of negative 
association and relevant properties, see Appendix |A") 

Lemma 3.6. Suppose d > 1 and define fi = 
0(^-(\\x — Xs\\ 2 + IMI2))- There exist random variables 
Zi such that the variables Yf are stochastically dominated 
by Zi, the Zi are negatively associated, E[Zi] = fi, and 
E[Z|]=0( M 2 ). 

Proof. The choice of the L h depends only on the values of 
A over S; hence conditioned on knowing Li we still have 
A(x — xs) distributed randomly over the space. Further- 
more the distribution of A and the reconstruction algo- 
rithm are invariant under permutation, so we can pre- 
tend that v is permuted randomly before being added to 
Ax. Define B iy(1 to be the event that q € supp(Aa), and 
define Di tq G { — 1, 1} independently at random. Then 
define the random variable 



V q = (b - Ax s ) q = V q 



ie[n]\S 



XiI3i, q Ui, q . 



Because we want to show concentration of measure, we 
would like to show negative association (NA) of the 
Yi = mcdian,j e £. A qi V q . We know v is a permutation 
distribution, so it is NA |JP83j . The Bi q for each i as 
a function of q are chosen from a Fermi-Dirac model, so 
they are NA |DR96] . The Bi A for different i are inde- 
pendent, so all the Bi q variables are NA. Unfortunately, 
the L>i q can be negative, which means the V q are not 
necessarily NA. Instead we will find some NA variables 
that dominate the V q . We do this by considering V q as a 
distribution over D. 

Let W q = E D [V q 2 } = vl + Eieln]\s^i B i,r As in " 
creasing functions of NA variables, the W q are NA. By 
Markov's inequality Pr£>[V g 2 > cW q ] < |, so after choos- 
ing the Bi q and as a distribution over D, V q is dominated 
by the random variable U q = W q F q where F q is, indepen- 
dently for each q, given by the p.d.f. /(c) = 1/c 2 for c > 1 
and /(c) = otherwise. Because the distribution of V q 
over D is independent for each q, the U q jointly dominate 
the U g 2 . 

The U g are the componentwise product of the W q with 
independent positive random variables, so they too are 
NA. Then define 



median U„ . 



As an increasing function of disjoint subsets of NA vari- 
ables, the Zi are NA. We also have that 

Y 2 = (median A qi V q ) 2 < (median |V q |) 2 

qeLi ' qGLi 

= median VI 2 < median U„ = Zi 
qeL, q " q£L, 

so the Zi stochastically dominate Y?. We now will bound 
E[Z 2 }. Define 

H = E[W q \=E[i? q ]+ J^ AE\B lA \ 

ie[n]\S 

d ii n 2 i 1 n n 2 

2 2\ 



- ~k^ X " XS ^2 + \W\\2) 



Then we have 



Pr[W q > cfj] < 



/■OO 

Pr[U q > cfi] = / f(x) Vr[W q > cfi/x]dx 
Jo 



, 1 x , f°° l j 1 + lnc 

'- / —z-dx-\- / —zdx — 

!i x z c J c x z c 



Because the U q are NA, they satisfy marginal probability 
bounds |DR96j : 

Pr[U q >t qi qe [w]} < Y[ Pr[U q > t q ] 
»6[n] 

for any t q . Therefore 

Pr[Z< > cm] < Y. II Pr ^ ^ c ^ 



ret, geT 

|T| = |£,|/ 2 



< 2l L 'l 



1+lnc 



l^l/ 2 



(2) Pr[Zi > c/i] < I 4 



1 + lnc 



d/2-1 



If d > 7, this makes E[^] = O(^) and E[Z 2 ] = 0(/^ 2 



D 



3.6 Bound on component size 

Lemma 3.7. Let Di be the number of hyperedges in the 
same component as hyperedge i. Then for any i =/= j , 



Proof. The intuition is that if one component gets larger, 
other components tend to get smaller. Also the graph 
is very sparse, so component size is geometrically dis- 
tributed. There is a small probability that i and j are 
connected, in which case Di and Dj are positively cor- 
related, but otherwise Di and Dj should be negatively 
correlated. However analyzing this directly is rather diffi- 
cult, because as one component gets larger, the remaining 
components have a lower average size but higher variance. 
Our analysis instead takes a detour through the hyper- 
graph where each hyperedge is picked independently with 
a probability that gives the same expected number of hy- 
peredges. This distribution is easier to analyze, and only 
differs in a relatively small Oiyk) hyperedges from our 
actual distribution. This allows us to move between the 
regimes with only a loss of 0(-^=), giving our result. 

Suppose instead of choosing our hypergraph from 

G d (w, k) we chose it from G d (w, -Mr); that is, each hyper- 
Id,) 
edge appeared independently with the appropriate prob- 
ability to get k hyperedges in expectation. This model is 
somewhat simpler, and yields a very similar hypergraph 
G. One can then modify G by adding or removing an 
appropriate number of random hyperedges I to get ex- 
actly k hyperedges, forming a uniform G € G d (w, k). By 
the Chernoff bound, |/| < O(vfclogfc) with probability 
1 L_ 

Let Di be the size of the component containing i in G, 
and Hi — D 2 — D t . Let E denote the event that any 
of the Di or Di is more than Glogfc, or that more than 
C\fk log k hyperedges lie in /, for some constant C. Then 
E happens with probability less than -^ for some C, so 
it has negligible influence on E[D 2 D 2 ]. Hence the rest of 
this proof will assume E does not happen. 

Therefore Hi = if none of the O(vfclogfc) random 
hyperedges in I touch the O(logfc) hyperedges in the 
components containing i in G, so Hi — with proba- 
bility at least 1 - 0(^£^). Even if H % ^ 0, we still have 
\H t \<{D 2 + D 2 )<0{\og 2 k). 

2 

Also, we show that the D t are negatively correlated, 
when conditioned on being in separate components. Let 
D(n,p) denote the distribution of the component size of a 
random hyperedge on G d (n,p), where p is the probability 
an hyperedge appears. Then D(n,p) dominates D(n',p) 
whenever n > n' -- the latter hypergraph is contained 
within the former. If Cij is the event that i and j are 
connected in G, this means 



log 6 fc, 



Cov(D 2 ,D 2 ) = E[D 2 D 2 ] - ELD 2 ] 2 < 0{-^= 



ELD? I D< 



t, Gi 



0] 



Furthermore, ELD 2 ] = O(l) and E[Df] = 0(1). 



is a decreasing function in t, so we have negative corre- 



lation: 

V\P*d) I C t , 3 = 0] < E[Z> 2 | C id = 0] EfD 2 | C^ = 0] 
<E[Z> 2 ]E[Z> 2 ]. 



Furthermore for i 7^ j, Pr[Cij = 1] = 
E=tEj^E[C w ] = ^Igbl = (i/fc). Hence 



E[C, 



Lemma 3.8. M^e can recover x' from Ax + v and S with 

\W -Xs\\ 2 < e(\\x-x s \\ 2 + j|z/|| 2 ) 

with probability at least 1 21,1/3 * n 0(/c) recovery time. 

Our A has O(-^k) rows and sparsity O(l) per column. 

Proof. Our total error is 



E[.D 2 £ 2 ] =E[D 2 i D 2 j I dj = 0] Pr^j = 0]- 



3 



E^D,. I C itj = 1] Pr[Cij = 1] 
^E^E^RO^). 



Therefore 



E[A 2 ^ 2 ] 

= E[(S 2 + ^)(S 2 + ff 



^2^2 



2, 



= E[D l Dj] + 2 E[HiDj] + E[H t Hj] 

< E[D 2 ] E[D 2 ] + 0(2^ log 4 fc + !?£* log 2 fe) 



Vfc 



= E[D 2 -H l } 2 + 0( 



log 6 fe 



= E[,D 2 ] 2 - 2E[ff i ] E[,D 2 ] + E[ff,] 2 + 0(^=^) 



<E[L> 2 ] 2 + 0( 



log 6 fc 



Now to bound E[D 4 ] in expectation. Because our 
hypergraph is exceedingly sparse, the size of a com- 
ponent can be bounded by a branching process that 
dies out with constant probability at each step. Using 
this method, Equations 71 and 72 of [COMS07J state 

that Pr[75 > k] < e~°W. Hence E[D 2 } = O(l) and 

— 4 
E[Dj] = 0(1). Because Hi is with high probability and 

0(log k) otherwise, this immediately gives E[Z? 2 ] = O(l) 

and E[Df] = 0(1). ' D 

3.7 Wrapping it up 

Recall from Corollary |3.2| that our total error 

b'-^|| 2 <4$> 2 A 2 <4$>A 2 - 

i i 

The previous sections show that Zi and D 2 each have 
small expectation and covariance. This allows us to apply 
Chebyshev's inequality to concentrate 4^ i Zi_D 2 about 
its expectation, bounding \\x' — Xs\\ 2 with high probabil- 
ity: 



Then by Lemma 3.6 and Lemma |3.7[ 

E[4^ Z % D 2 ] = 4J2 E ^] E ^ 2 ] = k » 

i i 

where fi = 0(^(\\x — xs\\ 2 + IMIa))* Furthermore, 
E[(£ Z t D 2 ) 2 } = ]T E[Z?DJ] + £ EiZ^DfD]} 

i i i^j 

= ]T E[Z 2 ] E[Df] + J2 E[ZiZj] E[D 2 D 2 ] 

i ijtj 

<E°(^)+E E ^ E ^KE[A 2 ] 2 + o( 1 ^ fc )) 

= 0(n 2 kVk log 6 k) + k(k - 1) E[Z,L> 2 ] 2 
Var(£ Z,D\) = E[(£ Z t D 2 ) 2 ] - k 2 E[Z % D 2 \ 2 

i i 

< 0(n 2 kVk\og 6 k) 
By Chebyshev's inequality, this means 

Pr[4^ Z. L D 2 > (1 + c)ixk] < 0( J 



,log 6 fc 



c 2 \fk 



Pr[||x' - x s f 2 > (1 + c)Ce 2 (\\x x s \\ 2 2 + |M| 2 )] < 0{—^) 



for some constant C. Rescaling e down by \/C(l + c) 
we can get 



\\x'-xs\\ 2 <e(\\x~xs\\ 2 + \W\\ 2 ) 
with probability at least 1 



c 2 fc l/3- 



n 



Now we shall go from k l ' s probability of error to k c 
error for arbitrary c, with 0{c) multiplicative cost in time 



and space. We simply perform Lemma 3.8 O(c) times in 



parallel, and output the pointwise median of the results. 
By a standard parallel repetition argument, this gives our 
main result: 



Theorem 

with 



3.1 



We can recover x' from Ax + v and S 



\\x' -x s \\ 2 < e(\\x-x s \\ 2 + \M 2 ) 

with probability at least 1 — A- in 0(ck) recovery time. 
Our A has O(-^k) rows and sparsity 0{c) per column. 



Proof. Lemma 3.8 gives an algorithm that achieves 
0(fc -1 / 3 ) probability of error. We will show here how 
to achieve fc~ c probability of error with a linear cost in 
c, via a standard parallel repetition argument. 

Suppose our algorithm gives an x' such that 
\\x' — Xs\\ 2 < P" with probability at least 1 — p, and that 
we run this algorithm m times independently in parallel 
to get output vectors x ,..., x m . We output y given by 
yi = median ; - e [ m ](a;' : ')i, and claim that with high proba- 
bility \\y-x s \\ 2 < /i\/3. 

Let J = {j £ [to] | 1 1 a;-' '— xs\\ 2 < p}- Each 
j 6 [to] lies in J with probability at least \ — p, so 
the chance that \J\ < 3to/4 is less than ( r "/ 4 )p m/4 < 

(4ep) m / 4 . Suppose that \J\ > 3m/4. Then for all i <= S, 
{j e J | (x j ) l <yi}\ > | J\ - f > | J\ /3 and similarly 
{j e J j (x j ); > Vi}\ > |J|/3. Hence for all i e 5, 
yi — Xi\ is smaller than at least \J\ /3 of the | (a; J )j — x% | 

for j G J. Hence 



!Zl. 

3 



i'i/* a >EE«^)«-*) a ^Eir(»- x «> 5 

i&S je.J ieS 

\J\ |, „a 

or 

||y-.T|| 2 < \/3/j. 

with probability a t lea st 1 — (4ep) m / 4 . 



Using Lemma 



to get p 



16feV3 



and fi 



e(\\x — xs\\ 2 + HHI2)' vvith m — 12c repetitions we get 
Theorem O □ 



4 Applications 

We give two applications where the set query algorithm 
is a useful primitive. 

4.1 Heavy Hitters of sub-Zipfian distri- 
butions 

For a vector x, let r, be the index of the ith largest el- 
ement, so \x n \ is non-increasing in i. We say that x is 
Zipfian with parameter a if \x ri \ = 0(|x ri | i~ a ). We say 
that x is sub-Zipfian with parameters (k, a) if there ex- 
ists a non-increasing function / with \x r .\ = 0(f(i)i~ a ) 
for all i > k. A Zipfian with parameter a is a sub-Zipfian 
with parameter (k,a) for all k, using f(i) = \x ri \. 

The Zipfian heavy hitters problem is, given a linear 
sketch Ax of a Zipfian x and a parameter k, to find a 
fc-sparse a/ with minimal ||a; — x'\\ 2 (up to some approx- 
imation factor). We require that x' be fc-sparse (and no 
more) because we want to find the heavy hitters them- 
selves, not to find them as a proxy for approximating 
x. 



Zipfian distributions are common in real-world data 
sets, and finding heavy hitters is one of the most impor- 
tant problems in data streams. Therefore this is a very 
natural problem to try to improve; indeed, the original 
paper on Count-Sketch discussed it CCF02]. They show 
a result complementary to our work, namely that one can 
find the support efficiently: 



Lemma 4.1 (Section 4.1 of |CCF02) ). If x is sub-Zipfian 
with parameter {k, a) and a > 1/2, one can recover a 
candidate support set S with \S\ — O(k) from Ax such 
that {r 1; ...,rJC5, A has O(fclogn) rows and recovery 
succeeds with high probability in n. 

Proof sketch. Let Sk — {r\, ■ ■ ■ ,rk}. With O(^fclogn) 
measurements, Count-Sketch identifies each Xi to within 
r || a; — xs k || 2 with high probability. If a > 1/2, this is 
less than \x rk \ /3 for appropriate e. But \x rgk \ < \x rk \ /3. 
Hence only the largest 9fc elements of x could be esti- 
mated as larger than anything in x$ k , so the locations of 
the largest 9k estimated values must contain Sk- □ 



It is observed in CCF02J that a two-pass algorithm 
could identify the heavy hitters exactly. However, with a 
single pass, no better method has been known for Zipfian 
distributions than for arbitrary distributions; in fact, the 
lower bound DIPW10 on linear sparse recovery uses a 
geometric (and hence sub-Zipfian) distribution. 

As discussed in [CCF02J . using Count-SketctF] with 
0(j| log ?i ) rows gets a fc-sparse x' with 



x — x 



|| 2 < (l + e)ETr 2 (x,k) = e(^d=k^-°). 

V 2a — 1 



where, as in Section [T] 

Err2(x, fc)= min ||x — x|| 2 . 

/c-sparse x 

The set query algorithm lets us improve from a 1 + e 
approximation to a 1 + o(l) approximation. This is not 
useful for approximating x, since increasing fc is much 
more effective than decreasing e. Instead, it is useful for 
finding fc elements that are quite close to being the actual 
fc heavy hitters of x. 

Naive application of the set query algorithm to the out- 



put set of Lemma 4.1 would only get a close (9(fc)-sparse 
vector, not a fc-sparse vector. To get a fc-sparse vector, 
we must show a lemma that generalizes one used in the 
proof of sparse recovery of Count-Sketch (first in [CM06J , 
but our description is more similar to |GI10j ). 



7 Another analysis ( CMQ5]) uses Count-Min to achieve a better 
polynomial dependence on e, but at the cost of using the l\ norm. 
Our result is an improvement over this as well. 
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Lemma 4.2. Let x,x' £ R". Let S and S' be the loca- 
tions of the largest k elements (in magnitude) of x and 
x 1 , respectively. Then if 

(*) ||(s'-sW'|l2<eErr2(a:,fc), 

for e < 1, we have 

\\x' s ,-x\\ 2 < (l + 3e)Err 2 (x,fc). 

Previous proofs have shown the following weaker form: 

Corollary 4.1. If we change the condition (*) to 
\\x' — ar]| < — ^=Err 2 (:r, k), the same result holds. 

The corollary is immediate from Lemma |4.2| and 

II {X' - X) SUS ,\\ 2 < y/lSU&WW - X)SUS>\\00- 

Proof of Lemma \4-%\ We have 

(3) 
11*5' - x \\l = \\( x ' - x )s'\\l 



Therefore 



\ X S\S' 



\ x S'\s\\l < V2eE(2E + V2eE) 

< (2V2 + 2)eE 2 

< 5eE 2 . 



\ x s\S> 



x [n]\(SUS')\\ 2 



The tricky bit is to bound the middle term || #s\S' |L. We 

II II 2 

will show that it is not much larger than sgng . 

Let d = \S\ S'\, and let a be the d-dimensional vector 
corresponding to the absolute values of the coefficients 
of x over S \ S'. That is, if S \ S' = {ji, . . .,j d }, then 
a i = \ x ji\ f° r * G H- Let a' be analogous for x' over 
S\S', and let b and b' be analogous for x and x' over 
S' \ S, respectively. 

Let E = Err 2 (a;, k) = \\x — x,s\\ 2 - We have 



Plugging into Equation k3| and using \\(x' — x)s'\\ 2 — 
e 2 E 2 , 

Wx's, -x\\ 2 < e 2 E 2 + 5eE 2 + |[*S'\5|| 2 + || a; [r 1 ]\(S'uS') || 2 

< 6eE 2 + ||a;[ n ]\s|| 2 

= (l + 6e)E 2 
\\x' s ,~x\\ 2 <(l + 3e)E. 

n 

With this lemma in hand, on Zipfian distributions we 
can get a fc-sparse x' with a l+o(l) approximation factor. 

Theorem 4.1. Suppose x comes from a sub-Zipfian dis- 
tribution with parameter a > 1/2. Then we can recover 
a k-sparse x' from Ax with 



\\x'-x\\ 2 < 



ylogn 



Err 2 (a:, k). 



with 0(-%k log n) rows and O(nlogn) recovery time, with 
probability at least 1 — -p- . 



Proof. By Lemma 4.1 we can identify a set S of size 0(k) 

3rs. M 
with 



that contains all the heavy hitters. We then run the set 
query algorithm of Theorem 



\ X S\S> 



I II II l|2 II lii 2 

Fs'\s|| 2 = IMI 2 - l|o|| 2 

= (a - b) ■ (a + b) 
<||o-6|| 2 ||o + 6|| 2 



for e. This gives an x with 
ll*-*5|| 2 < 



3.1 



3yTog^ 



3v / log^i 



zErr 2 (a;, k). 



substituted 



< o- 



(2||6|| 2 + || a -6|| 2 ) 



<||o-6|| 2 (2£ + ||o-6|| 2 ) 

So we should bound \\a — b\\ 2 . We know that ||p| — |g|| < 
\p — q\ for all p and q, so 

||o-o , ||^ + ||6-6'||^<||( a: -a/)g VS /||^ + ||(ar- 

<\\( x ~ x ')suS'\\l<e 2 E 2 



Let x' contain the largest k coefficients of x. 
Lemma [O] we have 



Hs'-xlla^CH- 



V / Iogn 



)Err 2 (a;,fc). 



By 



n 



)S'\S\\ 2 



We also know that a — b and b' — a' both contain all 
nonncgative coefficients. Hence 

H0-6H2 < \\a-b + b'-a'f 2 

<(\\a-a'\\ 2 + \\b>-b\\ 2 f 



< 2||o- 

< 2( 2 E 2 
-b\\ 2 < V2eE. 



a , ||' + 2||6-6'|l^ 



4.2 Block-sparse vectors 

In this section we consider the problem of finding block- 
sparse approximations. In this case, the coordinate set 
{l...n} is partitioned into n/b blocks, each of length 
b. We define a (fc, b) -block-sparse vector to be a vector 
where all non-zero elements are contained in at most k/b 
blocks. That is, we partition {1, . . . ,n} into Tj = {(i — 
1)6 + 1, . . . , ib}. A vector x is (k, b) -block-sparse if there 
exist Si, ■ ■ ■ , S k/b £ {Ti, . . . , T n/b \ with supp(x) C \J 5,. 
Define 



Err 2 (x, k, b) 



(k, b) — block-sparse x 



2 ' 
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Finding the support of block-sparse vectors is closely 
related to finding block heavy hitters, which is studied for 
the £i norm in [ABI08J. The idea is to perform dimen- 
sionality reduction of each block into logn dimensions, 
then perform sparse recovery on the resulting — ir^" 
sparse vector. The differences from previous work are 
minor, so we relegate the details to Appendix [C] 

Lemma 4.3. For any b and k, there exists a family of 
matrices A with 0(-^\ogn) rows and column sparsity 
O(pjTogn) such that we can recover a support S from 
Ax in 0{^r logn) time with 

\\x - x s \\ 2 < (1 + e)Err 2 (x, k, b) 

with probability at least 1 — n - ^' 1 '. 

Once we know a good support S, we can run Algo- 
rithm [XT] to estimate x$'- 



Theorem 4.2. For any b and k, there exists a family of 
binary matrices A with 0{\k-\- -£g logn) rows such that 
we can recover a (k, b)-block-sparse x' in 0{k + 4rr logn) 
time with 

\W -x\\ 2 < (l + e)ErT 2 (x,k,b) 

with probability at least 1 — , n(1) . 



Proof. Let S be the result of Lemma 4.3 with approxi- 
mation e/3, so 

\\x-x s \\ 2 < (l + |)Err 2 (a:,*,6). 

Then the set query algorithm on x and 5* uses 0(k/e 2 ) 
rows to return an x' with 



Therefore 



x s \\ 2 < ^\\x-xs\\ 2 . 



\\x' -x\\ 2 < \\x' - x s \\ 2 + \\x - xs\\ 
<{l + ^)\\x-x s \\ 2 
< (l + ^) 2 Err 2 (x,k,b) 



< (l + e)En 2 (x,k,b) 



as desired. 



D 



If the block size b is at least logn and e is constant, 
this gives an optimal bound of 0(k) rows. 

5 Conclusion and Future Work 

We show efficient recovery of vectors conforming to Zip- 
fian or block sparse models, but leave open extending this 



to other models. Our framework decomposes the task 
into first locating the heavy hitters and then estimating 
them, and our set query algorithm is an efficient gen- 
eral solution for estimating the heavy hitters once found. 
The remaining task is to efficiently locate heavy hitters 
in other models. 

Our analysis assumes that the columns of A are fully 
independent. It would be valuable to reduce the inde- 
pendence needed, and hence the space required to store 
A. 

We show /c-sparse recovery of Zipfian distributions with 
1 + o(l) approximation in O(fclogn) space. Can the o(l) 
be made smaller, or a lower bound shown, for this prob- 
lem? 
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A Negative Dependence 

Negative dependence is a fairly common property in 
balls-and-bins types of problems, and can often cleanly 
be analyzed using the framework of negative association 
r pR96llDPR96l[JP83] 1. 

Definition 1 (Negative Association). Let (Xi, . . . ,X n ) 
be a vector of random variables. Then (X\, . . . ,X n ) are 
negatively associated if for every two disjoint index sets, 
I,JC [n], 

E[f(X h ieI)g(X j ,J€J)] 

<nf{X l ,ieI)]E[g{X JlJ ^J)] 



for all functions f : Rl 7 l ->• R and g: Rl J l - 
both non- decreasing or both non-increasing. 



that 



If random variables are negatively associated then one 
can apply most standard concentration of measure argu- 
ments, such as Chebyshev's inequality and the Chernoff 
bound. This means it is a fairly strong property, which 
makes it hard to prove directly. What makes it so useful 
is that it remains true under two composition rules: 

Lemma A.l ([DR96 , Proposition 7). 

1. If (X\, . . . ,X n ) and {Y\, . . . ,Y m ) are each nega- 
tively associated and mutually independent, then 
(Xi, . . . , X n , Y\, . . . , Y m ) is negatively associated. 

2. Suppose (X\, . . . , X n ) is negatively associated. Let 
lit ■ • ■ ! h Q [n] be disjoint index sets, for some pos- 
itive integer k. For j G [k], let hf. M^l — > R 
be functions that are all non- decreasing or all non- 
increasing, and define Yj = hj(Xi,i G Ij). Then 
(Yi, . . . , Yk) is also negatively associated. 

Lemma [A . 1 1 allows us to relatively easily show that one 
component of our error (the point error) is negatively as- 
sociated without performing any computation. Unfortu- 
nately, the other component of our error (the component 
size) is not easily built up by repeated applications of 
Lemma A.l ! Therefore we show something much weaker 
for this error, namely approximate negative correlation: 



E[X t X 3 ] - V[Xi}E[Xj\ < 



1 



k n(i; 



E[A^]E[A0 



8 This paper considers the component size of each hyperedge, 
which clearly is not negatively associated: if one hyperedge is in a 
component of size k than so is every other hyperedge. But one can 
consider variants that just consider the distribution of component 
sizes, which seems plausibly negatively associated. However, this 
is hard to prove. 



for all i y^ j. This is still strong enough to use Cheby- 
shev's inequality. 

B Set Query in the l\ norm 

This section works through all the changes to prove the 
set query algorithm works in the i\ norm with w — O(-k) 
measurements. 

We use Lemma [3~5| to get an l\ analog of Corollary|3.2| 



(4) Wx'-xsW^Y^Mx'-xs)^ 

<E 2 E c ^i = 2 E^i y <i- 

Then we bound the expectation, variance, and co- 
variance of Di and \YA. The bound on Di works the 



same as in Section QM E[A] = 0(1), E[Df] = 0(1) 
E[ADj] ~ E[A] 2 < 0(log 4 k/Vk). 

The bound on \Yi\ is slightly different. We define 

^ = kl + E W B v* 

ie[n]\S 

and observe that U' > \V q \, and U' is NA. Hence 
Z!, = median U' 

qeLi q 

is NA, and \Y % \ < Z[. Define 

l i = E[U' q ] = -\\x-x s \\ 1 + -\\u\\ l 



then 



<^(l|s-a;s||i + |Mli) 



Vi[Z[ >c f i] < 2 l^l(i)l L >l/ 2 < 



so E[Zr\ = 0{jt) and E[Zf ] = 0( M 2 ). 

Now we will show the analog of Section |3.7| We know 



\x'-x s \\ 2 <2Y J D i Z' i 



and 



E[2 J2 DiZl] = 2 J2 E[A] E[^'] = kn' 

i i 

for some \J — 0(^(\\x — xs\\x + IMIi))* Then 

e[(£ D t z' t ) 2 } = J2 E tA 2 ] m?\ + E E i D * D A E ^ 



i^J 



< J2 0(m' 2 ) + $>[A] 2 + 0(log 4 k/Vk)) E\Z[\ 
i i=jLj 

= 0(p /2 kVk\og 4 k) + k{k - 1) E[AZ-] 2 

Var(2^Z l 'A) < 0(^ /2 kVk\og 4 k). 
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By Chebyshev's inequality, we get 

a z v k 

and the main theorem (for constant c = 1/3) follows. 
The parallel repetition method of Section |3.7| works the 
same as in the £2 case to support arbitrary c. 



C Block Heavy Hitters 



Lemma 4.3, For any b and k, there exists a family of 



matrices A with 0(-^\ogn) rows and column sparsity 
0(4?logn) such that we can recover a support S from 
Ax in 0(^k logn) time with 

\\x-x S \\ 2 < (l + e)Err 2 (x,fc,6) 
with probability at least 1 — n W. 



Proof. This proof follows the method of |ABI08| , but ap- 
plies to the £2 norm and is in the (slightly stronger) sparse 
recovery framework rather than the heavy hitters frame- 
work. The idea is to perform dimensionality reduction, 
then use an argument similar to those for Count-Sketch 
(first in CM06J, but we follow more closely the descrip- 
tion in pITOJ ). 

Define s — k/b and t = n/b, and decompose \n] into 
equal sized blocks T 1; . . . , T t . Let X(T;) < 
restriction of XT t to the coordinates Ti 
have \U\ = s and contain the s largest blocks in x, so 
Err 2 (a;,fc,6) = \\Y,i£u x Ti || 2 < 

Choose an i.i.d. standard Gaussian matrix p € 

for m = 0(j2 logn). Define y q: i — (pxrr ))i> so as a dis- 

11 11 2 

tribution over p, y qi is a Gaussian with variance ||#(t ) L- 

Let hi, ... , h m : [t] — > [/] be pairwise independent hash 

functions for some I = O(^s), and gi,-..,g m - [t] —> 

{ — 1,1} also be pairwise independent. Then we make 

m hash tables H^\ . . . , ij( m ) of size I each, and say that 

the value of the jth cell in the ith hash table H^> is given 

by 



l b denote the 
Let U C [t] 



Kb 



H 



(i) 



E 9i(q)y q ,: 

Q-hi(q)=j 



r(i) 



Then the if- form a linear sketch of ml — 0(-§j ) logn) 
cells. We use this sketch to estimate the mass of each 
block, and output the blocks that we estimate to have 
the highest mass. Our estimator for ||xtJ| 2 is 



z[ = a median 
je[m] 



H, 



(i) 

hj (i) 



for some constant scaling factor a « 1.48. Since we only 
care which blocks have the largest magnitude, we don't 
actually need to use a. 



We first claim that for each i and j with probability 
l-O(e), (H^ (i) -y id ) 2 < 0(^(Err 2 (x,fc,6)) 2 ). To prove 
it, note that the probability any q G U with q ^ i having 
hj(q) = hj(i) is at most f < e 3 . If such a collision with 
a heavy hitter does not happen, then 



E[(ffg ) (4) -W,i) 2 ]=E[ 



E yU 

p= ! ti,h j {p) = h j {i) 



< 



Ey%: 



2 -1 

P,J> 



pfu 



^ 



X T„ 



p$U 



(Err 2 (a:, k,b)Y 



By Markov's inequality and the union bound, we have 



Let Bi 



Vi,jY 



> -(Err 2 (x, k, b)) 2 ] < e + e 3 = 0(e) 
s 



be the event that 



(< 



i? 



> 



0{^-(ETT 2 (x,k,b)) 2 ), so Pr[B id ] = 0(e). This is inde- 
pendent for each j, so by the Chernoff bound J2T=i Bi,j < 
0(em) with high probability in n. 

Now, \yi t j\ is distributed according to the positive half 
of a Gaussian, so there is some constant a « 1.48 such 
that a \yij\ is an unbiased estimator for |ja?Tv lla- For an y 
C > 1 and some S = O(Oe), we expect less than 1 ~ 2 Ce m 
of the a \yi,j\ to be below (1 
to be above (1 



2 

1-Ce„ 



■S) Hcc^ || 2 , less than 2 m 
S) || XTi || 2 ; an d more than Cem to be in 
between. Because m > $7(4^ logn), the Chernoff bound 
shows that with high probability the actual number of 



a \yij\ in each interval is within 



O(ilogn) of its 



2 expectation. Hence 



I x Ti 1 1 2 — q; median \ytj 

je[m] 



<5||»r 4 || 2 = 0(Ce)||a;r 4 || 



even if 



(C-l)e 



m of the y.ij were adversarially modified. 



We can think of the events Bij as being such adversarial 
modifications. We find that 



\XTih 



\xTi || 2 — a median 
je[m] 



H, 



(i) 

hj{i) 



<0(e)||a:T,|| 2 + 0(^Err 2 ( a; ,fc,&)). 
v s 



(||s T J 2 - Zi f < 0(e 2 \\x Tl \\l + e -(EYY 2 (x,k,b)) 2 ) 

Define Wi = H^tJ^j M = Err 2 (a;, k,b), and U C [t] to 
contain the s largest coordinates in z. Since z is com- 
puted from the sketch, the recovery algorithm can com- 
pute U . The output of our algorithm will be the blocks 
corresponding to U. 
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We know fj 2 = J2i& 
0(ewi - 



U w i 



and 



Zi\ < 



=/j,) for all i. We will show that 



W: 



t]\U 



<(l + 0(e)) M 2 . 



This is analogous to the proof of Count-Sketch, or to 



Corollary [47T] Note that 

2 

,J [t)\U 



J u\u 



[t]\(uuu) 



For any i e U\U and j € U \ U, we have z, > Zj, so 



V s 

Let a = max,- rnr - r roj and 6 = min._y- n , r io,-. Then 
a < b + 0(-j=n + ea), and dividing by (1 — 0(e)) we 



%A 



get a < b(l + 0(e)) + 0(-j=ia). Furthermore 



w u\u 



> 



U\U 



, so 



w 



u\u 



< 



w 



u\u 



1 + Ote) 



u\u 






u\u 



< 



w u\u 



(l + 0(e)) + 0(e/i) 



J u\u 



(l + 0(e)) + (2 + 0(e)) 



< 



+ 0(eV) 

2 



<%W 



0(e/x 2 ) 



w 



u\u 



0(efi) 



because 



J u\u 



< fj,. Thus 



\w-w \\ 2 



J [t]\U 



< Oiey 2 
= 0(e^ 2 



w u\u 



\t]\(UUU) 



/' 



(l + 0( e )) M 2 . 



This is exactly what we want. If S — U ie (jTi contains 
the blocks corresponding to U, then 



\\x- x s \\ : 



w 6 L < (l+0(e))M=(l+0(e))Err 2 (x,fc,6) 



Rescale e to change l + 0(e) into 1+e and we're done. D 



1G 



